Partly melted DNA conformations obtained with a probability peak finding method 



Eivind T0steserQ 

Department of Tumor Biology, The Norwegian Radium Hospital, N-0310 Oslo, Norway 

(Dated: February 8, 2008) 

Peaks in the probabilities of loops or bubbles, helical segments, and unzipping ends in melting 
DNA are found in this article using a peak finding method that maps the hierarchical structure 
of certain energy landscapes. The peaks indicate the alternative conformations that coexist in 
equilibrium and the range of their fluctuations. This yields a representation of the conformational 
ensemble at a given temperature, which is illustrated in a single diagram called a stitch profile. This 
article describes the methodology and discusses stitch profiles vs. the ordinary probability profiles 
using the phage lambda genome as an example. 
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I. INTRODUCTION 

DNA melts by a stochastic formation and growth of 
loopspji) and tails (i.e. unzipping ends). Loop forma- 
tion is also induced in the dense cellular environment 
and is at the heart of DNA biology 3J. For the last 
three decades, numerically calculated properties of the 
DNA melting process have been represented by plotting 
curves of three types: probability profiles, melting curves, 
and Tm profiles. Probability profiles 0, are plots of 
the basepairing probability Ph p {i) or the "upside-down" 
l^Pbp(«) vs. sequence position i. Melting curves 0,0 are 
plots of the hclicity or its derivative vs. temperature T. 
Tm profiles are plots of the basepair melting temperature 
T m (i) vs. sequence position i. Apparently, there has been 
less interest in calculating other properties, one reason 
being perhaps that the ordinary plots outline the main 
features on the experimental side, such as the melting 
curves from UV spectroscopy and differential scanning 
calorimetry. However, there are other interesting prop- 
erties within reach of calculations. For example, what 
are the sizes and locations of loops and how do they fluc- 
tuate? How do distant loops correlate? What are the 
alternative conformations of a region that coexist when 
it melts? What events are predominant and what others 
are rare? In addition to these questions being important 
per se, advances in single molecule techniques 0, S 
provide new types of measurement of the micromechan- 
ical, dynamical, and structural properties, and motivate 
predictions beyond the ordinary curves [9|, llfl lll| . 

In this article, we turn our attention to stitch pro- 
files. A stitch profile is a diagrammatic representation 
of the alternative DNA conformations that coexist at a 
given temperature 12]. Figure ^ shows the four types 
of graphical elements called stitches that go into a stitch 
profile. A stitch represents either a loop, a right tail, 
a left tail or a helical region, as shown, and indicates 
its boundary positions and the ranges of fluctuation of 



'Correspondence: eivindto@radium.uio.no Web server: 
http: //stitchprof iles .uio .no/ 



a) Lo °P 

...11111110000000000000011111111... 

k) Right tail | 

...11111111111000000000000000000-3' 

„ \ I Left tail 

I ^_ 

5' -00000000000000000000111111111... 

d) 

. . .00000011111111111111110000000. . . 

Helix 

FIG. 1: A stitch profile is composed of four types of stitches: 
(a) loops, (b) right tails, (c) left tails, and (d) helices. Loop 
and tail stitches are drawn on the upper side and they span 
regions of opened base-pairs (0's). Helix stitches are drawn 
on the lower side and they span regions of closed base-pairs 
(l's). The horizontal bars indicate fluctuational ranges of the 
0-1 boundaries. 



these positions. In analogy with sewing, where a thread 
coming up and down through the fabric forms a row of 
stitches, any conformation of DNA is an alternating row 
of blocks of open or closed basepairs. A stitch profile in- 
dicates alternative conformations as alternative threads 
(i.e. paths) through the diagram. The aim of this work is 
to develop a method for constructing stitch profiles and 
to discuss stitch profiles vs. probability profiles. 

In the Poland-Scheraga model [Tj|, a state of the chain 
molecule is specified by N binary variables, x±, . . . , xn, 
where the j'-th variable Xj (= or 1) indicates if the j-th 
base in the sequence is basepaired or not with the comple- 
mentary strand. While the classical three types of curves 
are based on calculating the basepairing probabilities re- 
lated to the state Xj of each basepair, a stitch profile, in 
contrast, is based on probabilities of blocks of basepairs 
being in states corresponding to loops, helical segments 
or tails. A stitch profile made "by hand" was introduced 
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in Ref. 12 (where we referred to it as a loop map) in or- 
der to suggest an application of such block probabilities. 
The article described a DNA melting algorithm with two 
important features: a speedup based on multiplication 
of symmetrical leftside and rightside partition functions; 
and the statistical weight of a basepair depending rigor- 
ously on both of its neighbors. These features allow the 
block probabilities to be easily calculated as follows. 

A loop is a consecutive series of O's (melted basepairs) 
bounded by l's at positions x and y, where 1 < x < 
y — 1 < N. The probability of a loop is calculated by 
decomposing the chain in three segments, 

Ploop (•£ , y) = Z xl0 (x)n(y - x)Z ix(y)/pZ. (1) 

Zxw(x) is a partition function characterizing the seg- 
ment [l,x + 1], Zoix(y) is a partition function charac- 
terizing the segment [y — 1,N], Z is the total partition 
function of the whole chain, f2(y — x) is the loop entropy 
factor (a function of loop size), and (3 is related to the 
equilibrium constant of complete dissociation of the two 
strands. A tail is a block of O's that extends to the end 
of the chain. The probability of a right tail from position 
N (the right chain end) to a bounding 1 at position x is 



Pri g ht(aO = Z X w{x)/pZ. 



(2) 



The probability of a left tail from position 1 (the left 
chain end) to a bounding 1 at position y is 



Pleft(y) = Z m x{y)/f3Z. 



(3) 



A helical region is a block of l's bounded by O's or by the 
chain ends. If x is the position of the first 1 in the block 
and y is the position of the last, where 1 < x < y < N, 
then the probability can be written 

Phchx(£,y) = Zxoi{x)E(x,y)Z 10 x{y)/(3Z, (4) 

where Zxoi(x) is a partition function characterizing the 
segment [1, x] and -Ziox(y) is a partition function charac- 
terizing the segment [y, N], The stacking chain function 
S(x,y) is the statistical weight of the block of l's, given 
as 



"(x v) = ( Sl ^ Sl ^nj=x+i s n(j) for x <y 
' s i (x) for x = y, 



(5) 



where sn(x) is the statistical weight of nearest neighbor 
basepairs (a pair of l's), Si(x) is the statistical weight of a 
helix-ending basepair (0 on one side, 1 on the other) , and 
sow(x) is the statistical weight of an isolated basepair (O's 
on both sides). Eqs. Q~Q correspond to Eqs. (12)-(15) 
in Ref. IT2I respectively. 

The block probabilities pi oop , Pright, Pioft, and phoiix 
depend on precisely located boundaries x and/or y. 
But thermal motion causes the boundaries to fluctuate. 
These fluctuations are represented by fluctuation bars 
in a stitch profile. They are not merely attributes like 
"error bars", but rather an essential ingredient. Each 



stitch represents not a single conformation of a region, 
but a grouping of conformations that are supposed to 
be related via fluctuations. In a plot of any of the block 
probabilities as a function of x and/or y, such a grouping 
will appear as a broad peak, and the fluctuation bar(s) 
indicate the extent of the peak. A stitch profile is simply 
a representation of the peaks in the four block probabil- 
ity functions, and the problem of constructing a stitch 
profile is basically a peak finding problem. 

The peak finding problem is important in data and sig- 
nal analysis in diverse areas of science, for example, in 
various types of spectroscopy and image analysis. Peak 
finding has also been used in statistical mechanics to 
define macroscopic states of RNA secondary structure: 
native, intermediate, molten, and denatured states |14| . 
The following issues apply to our case here: A peak's 
size is given by its volume rather than its height. If a 
peak is broad enough, it may have a higher volume than 
another peak, even if it has a lower height. Probability 
peak heights can be very low, so we can not use a height 
cutoff for detecting peaks. There is no erroneous noise in 
our calculated probabilities, so smoothing should not be 
used. Actually, the shapes of peaks are quite irregular 
with "peaks within peaks", so the problem is hierarchi- 
cal peak finding (analogous to hierarchical clustering vs. 
clustering). There may be limited space in a stitch pro- 
file, so it must be chosen which peaks to represent. 

The right and left tail stitches are found by ID peak 
finding in p r i g ht and p\ c tt , respectively, and the loop and 
helix stitches are found by 2D peak finding in pi oop and 
Pheiix, respectively. The challenge is not so much finding 
a peak as deciding its extent. In ID we use an interval — 
the fluctuation bar — to delimit a peak. In 2D we use a 
frame, that is, the cartesian product of the two fluctu- 
ation bars on the x-axis and the y-axis, to represent a 
peak by "framing" it. Let the peak volume p v be defined 
by the probability summed over the interval in ID or the 
frame in 2D. 

This article describes a probability peak finding 
method in ID based on a detailed mapping of the hi- 
erarchical structure. The 2D case is solved by combining 
the ID results for x and y. For the extent of a peak, the 
main idea is to find where the probability has dropped to 
a certain fraction relative to the peak maximum value. 
This fraction is controlled by a parameter to the algo- 
rithm and it determines the widths of the fluctuation 
bars. These widths, in turn, determine the peak volumes 
that can be used for choosing if stitches are included or 
not in the stitch profile. 



II. THE METHOD 

Minus the logarithm of a probability is an energylike 
quantity. Using this, we transform probability peak find- 
ing into finding the wells or lakes in a (pseudo)energy 
landscape. A peak with a certain ratio between the 
probability at the maximum and the probabilities at the 
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edges corresponds to a lake with a certain depth in an 
energy landscape. The analogy to mountain landscapes, 
lakes, ponds, etc., is standard in statistical mechanics 
[l5|. We use it here to redefine the stitch profile prob- 
lem of peak finding to be a lake finding problem in four 
energy landscapes — two ID landscapes: 



Ei(x) = -log 10 p rig ht(z), (6) 

E 2 {y) = -log 10 Pieft(j/), (?) 

and two 2D landscapes: 

E 3 (x,y) = -log 10 pi oop (x,y), (8) 

Ei{x,y) = -lag 10 p heiix (x,y). (9) 



A. Peak finding method in ID 

The ID method is described here using E\(x) as an 
example, but £2(2/) is treated the same way. Of all the 
possible lakes that can be created by filling water into the 
various wells, we restrict ourselves to considering only a 
finite set of representative lakes. Let be the set of 
sequence positions a at which Ei has an extremum. Min- 
ima and maxima in are alternating along the a;-axis. 
To each element a £ Vfi we associate a lake L(a) in the 
landscape, see Fig. |5J The altitude of the surface (wa- 
ter level) is E\ (a) and the lake surface spans the interval 
L(a) = [L L (a),L R (a)} given by 



a e L(a), (10a) 

\/x € L(a) : Et(x) <Ex(a), (10b) 

£i(Ll(o) - 1) > E 1 (a) or L L (a) = 1, (10c) 

Ei(L K (a) + 1) > Si (a) or L R (a) = N. (lOd) 



When a is a minimum, in most cases L(a) = {a}. When 
a is a maximum, the lake L(a) is nearly split in two ad- 
jacent lakes, divided at position a where the local depth 
becomes zero. However the corresponding probability 
peak is not split in two by a zero probability, so we con- 
sider L(a) as one lake. The bottom (3a of a lake L(a) is 
defined as the position with lowest energy in the lake, 

/3a = arg min Ei(x) (11) 

The depth D(a) of a lake L(a) is defined as the en- 
ergy difference between the surface and the bottom: 
D(a) — i?i (a) — Ei((3a). Some lakes are contained inside 
deeper lakes: L(a) C L(b). This partial ordering of lakes 
in Wi defines a hierarchical structure |l6j . Assuming that 
both the leftmost and the rightmost (on the x-axis) el- 
ements in are minima (terminal maxima could just 
be excluded), the elements in can be considered as 
the nodes of a binary tree. The root p\ of the tree is 
the global maximum and its lake L(p\) spans the entire 
sequence (or almost). 



1. Pedigree ordering 

Imagine a walk along the branches of the binary tree. 
In order to orient itself, the walker needs "roadsigns" at 
each node a that point the directions to the root p\ and 
the bottom (3a. This imposes a structure similar to a 
pedigree (i.e. tree of ancestors) as illustrated in Fig. [21 
Each node a p\ is connected upwards in the direction 
of the root to a unique node aa called the successor of a. 
This means L(a) C L(o~a). Each node a corresponding to 
a maximum is also connected downwards to two parent 
nodes: a father node na in the direction of the bottom; 
and a mother node p,a in the other direction. This means 
that the father is the parent with the lowest bottom, 
Ei(j3ira) < Ei(j3pa), and that f3ira — f3a. But it does not 
imply that D(ira) > D(pa). In contrast to an offspring 
tree, parents are located in the direction away from the 
root, rather than the reverse. Define the set of successors 
of a node a: 

S(a) = {a,cra,a 2 a,a a, . . . , pi}. (12) 

The set £(a) traces a path from a up to the root. Define 
the set of ancestors of a node a: 

A(a) = {b e Vfi|a e £(&)}. (13) 

The set A (a) is the subtree that has a as its root or top 
node. Each node a G belongs to a unique paternal 
line, 

11(a) = {<pa, . . . , a, 7ra, . . . , /3a}, (14) 

which is a maximal set of nodes that are related through 
a series of fathers. The series ends at their common bot- 
tom node Pa and, oppositely, it begins at a node, called 
the full node ipa, which is not itself a father. Each node 
corresponding to a minimum is the bottom of its paternal 
line. And each node which is either a mother or the root 
pi is the full node of its paternal line. For a node a that 
is both a minimum and a mother, 11(a) = {a}. The term 
"full" stems from filling water into a bottom: The pater- 
nal line indicates successively deeper lakes until the full 
lake is reached. The successor of the full lake belongs to 
another paternal line and corresponds to another bottom 
being filled. 



2. The maxdeep algorithm 

The lake finding task at hand is to search through the 
set of lakes corresponding to the set ^i, and select the 
lakes to be represented in a stitch profile. A possible solu- 
tion is the maxdeep algorithm. With a given parameter 
-Dmax, the algorithm finds all nodes a £ $i where 

D(a) < Ana X (15a) 
D(aa) > Anax or a = pi. (15b) 
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FIG. 2: (Color online) E\(x) is plotted for a 70 bp sequence to illustrate the pedigree ordering of lakes in an energy landscape. 
Lakes corresponding to each local maximum are shown as horizontal dashed (blue) lines. Arrows indicate the binary tree 
starting from the root pi = 4. Paternal lines are connected series of fathers shown as solid (red) arrows. The node a = 16, 
for example, has the lake L(a) — [15,26], bottom /3a = 20, depth D(a) ~ 0.8, father ira = 18, mother pa = 15, and successor 
a a = 28, as shown. 



In words, this means that they should be as deep as pos- 
sible without exceeding the maximum depth. In many 
cases, it can be expected that the depth increases in just 
a small step from a node to its successor, but not for full 
nodes that often have successors that are much deeper 
than themselves. Therefore, some of the selected lakes 
will have depths close to D max , while others (the full 
ones) can be more shallow. The maxdeep algorithm 
is shown in Fig. [3] It is not necessary to evaluate all 
a G v&i. Instead, the MAXDEEP algorithm involves a 
"tree climber" c that basically climbs up the paternal 
line II(pi) of the input node i = p\, starting from the 
bottom f3p\, until it exceeds Z? ma x, and then takes one 
step down again. At that point, c fulfills the criterion in 
Eq. (|15fl . while all other nodes in S(c) and A(c) do not. 
Subsequently, the algorithm calls itself recursively with 
mothers of S(c) as input nodes, in order to explore other 
paternal lines. 



The output of the maxdeep algorithm is illustrated in 
Fig. 0] Lakes have depths between zero and £> max and 
they cover a large fraction of the landscape. The figure 
shows that the effect of increasing D max is that some lakes 
become wider and deeper, some lakes merge, and some 
lakes (corresponding to full nodes) remain unchanged. 




c->list 



I (p.c) ->list 



FIG. 3: The maxdeep algorithm that finds stitches with 
depths below -D max . For a given input node i, it returns a list 
of those nodes in A(i) that fulfill the criterion in Eq. 1151 . 
c is the tree climber, — > means push (i.e. "put on a list"), 
and I(fic) is the output of a recursive call to the algorithm 
itself with pc as the input node. The diagram can be read in 
conjunction with Fig. [5] 
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Defining two ID landscapes, 



■M^) = -log w Pb.elix(x,N), 

E e(y) = -logi Phciix(l,2/), 



(17) 
(18) 



we can write E 4 (x,y) — E§(x) + E§(y) + const. This 
decoupling of x and y allows us to analyze the lakes in 
E 4 , based on ID analysis of E§ and Eq and their binary 
trees \&5 and tyg. 

Let a£ $5 and b G ^fe and consider the frame L(a) x 
L(b). Does such a frame enclose a 2D lake in £4? At 
least, the frame should cover a region of E 4 . The frame is 
said to be above the diagonal if Lr(o) < L\^(b), such that 
all (x, y) £ L{a) x L{b) are points in the helix landscape. 
Because of the decoupling, the minimum in a frame is 

arg min E 4 (x,y) = (/3a, (3b). (19) 

(i,j)£L(a)xL(fi) 



FIG. 4: Lakes in ID found by the maxdeep algorithm. The 
two plots show the same region of an energy landscape. The 
first plot shows the lakes found by the algorithm using Am = 
3 and the second plot shows the lakes with D ma x = 6. 

3. The largest peaks 

Some of the lakes found by the maxdeep algorithm 
are probably not very significant, having low depths and 
low peak volumes at the same time. A second selection 
process is required before we finally get the stitches for 
the profile. Since the MAXDEEP algorithm does not con- 
sider the peak volumes p v , these can be considered in the 
second selection. If we want to select the largest peaks, 
for example, this can be achieved by having a cutoff as 
a second parameter p c and select only nodes a with peak 
volume p v (a) > p c . 

B. Peak finding method in 2D 

While a ID lake is completely described as an interval 
[-Ll(o), lakes in the 2D landscapes £3 and £4 

have complicated contours and perhaps islands in the 
interior [16|. Fortunately, we only need to know a 2D 
lake's extents on the x-axis and the y-axis, in order to 
enclose it in a frame, which is needed in a stitch profile 
(cf. section HJ. 

1. The helix landscape 

Consider the lake finding problem in the helix land- 
scape E4 . Although isolated basepairs are allowed [x = y 
in Eq. we will ignore those instances for convenience, 
and require 1 < x < y < N in the landscape. It follows 
from Eqs. (gj and JSJ that 

Pheiix(l, y)Pheiix(a;, N) = _phoiix(a;, y)Pheiix(l, N). (16) 



Consider the energy barriers seen from {(5a, (3b): 

AE 4 {x,y) = E 4 (x,y)-E 4 ((3a,(3b) 

= E 5 {x) - E 5 {(3a) + E 6 (y) - E & {(3b) 
= AE 5 (x) + AE 6 (y). (20) 

Using Eq. (pbl . we find that AE 4 (x,(3b) < D(a) for 
all x E L(a), and using Eqs. I|10cl) and (|10d|l . we find 
that AE 4 (x,y) > D(a) just outside the west and east 
side of the frame. This means that L(a) is the extent 
on the x-axis of a 2D lake with depth D(a) and bottom 
(/3a, (3b). And vice versa, L{b) is the extent on the y-axis 
of a 2D lake with depth D{b) and bottom ((3a, (3b). If 
D(a) D(b), then a 2D lake with bottom ((3a, (3b) must 
have depth D < min{D(a), D(b)} to be confined by the 
frame, and it will not extend to all four frame sides. A 
lake with depth D > min{D(a), D(b)} is not confined 
and may not even have its bottom inside the frame. If 
D(a) = D(b) then the frame L(a) x L(b) is exactly the 
extent in both dimensions of a 2D lake with that depth. 

Unfortunately, we can not expect to find a's and 6's 
with equal depths. In order to best approximate 2D 
lakes, we want instead D(a) and D(b) to be as close as 
possible, that is, none of ira, aa, nb or ab should have 
depths in between D(a) and D(b). This can be formu- 
lated as two conditions: we say that (a, b) is "cr-above" 
if 



D(ab) > D(a) or b = p 6 , (21a) 

D(aa) > D(b) or a = p 5 , (21b) 

and we say that (a, b) is "7r-below" if 

D(nb) < D(a) or b = (3b, (22a) 

D(ita) < D(b) or a = (3a. (22b) 



Frames that are cr-above, 7r-below, and above the di- 
agonal are good representations of lakes with depth 
min{Z) (a), D(b)}. Let us examine the three conditions 
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one by one. First, define the set r 4 of frames that are 
a- above, 



{(a,b) S * 5 x * 6 |(a,6) is a- above}. 



(23) 



Eq. J2H) shows that (p5,pe) G T 4 . And (a, 6) 6 T 4 if 
a and 6 both correspond to minima, because Eq. H21(l 
is true with D{a) = and D(b) = 0. This means 
{13a, (3b) € T 4 for any a € * 5 and b G * 6 . Just as ID 
lakes are hierarchically ordered, so are frames. A frame is 
contained in another frame, L(a) x L(b) C L(c) x L(d), if 
a G A(c) and 6 6 A(d). It turns out that the elements of 
r 4 are the nodes of a binary tree with (ps, pg) a s its root. 
We call r 4 the frame tree, and it is a kind of product tree 
between the trees ^5 and The successor of a node 
(a,b) ± (p5,Pe) is 



cr(a, b) 



(era, b) if D(ab) > D(aa) or b = pa 
(a, <r6) if D{aa) > D(ab) or a = ps 



(24) 



Each node (a, 6) G T 4 , where a and b are not both min- 
ima, has two parent nodes. We define the bottom of a 
node as (3{a,b) = {(3a,f3b), which enables us to distin- 
guish the two parent nodes as a father, 



7r(a, b) 



(ira,b) if D(a) > D(b) 
(a, 71-0) if D(a) < D(b), 



with /3ir(a, b) = (3{a, 0), and a mother 



p(a, b) 



(fia,b) if D(a) > D(b) 
{a,pb) if D(a) < D{b), 



(25) 



(26) 



with (3p(a, b) ^ (3{a, b). This gives the frame tree a pedi- 
gree ordering with paternal lines, etc., just as in ID. 

Some frames in T 4 are not above the diagonal, such as 
the root frame (p5,pe) that spans almost the entire se- 
quence in both dimensions. Next, define the set of frames 
that are cr-above and above the diagonal: 



^4 = {( a -b) G r 4 |Lft(a) < Ll(&)}. 



(27) 



'J 4 is organized as a number of disjoint binary trees, ^4 = 
[Jj A(aj,bj), each one being a subtree of T 4 . The top 
node (a,j,bj) of the j-th subtree is above the diagonal, 
while its successor cr(aj,bj) crosses the diagonal. And 
lastly define the set of frames that are cr-above, 7r-below, 
and above the diagonal: 



T 4 = {(a,b) G ^4 1 (a, b) is 7r-below}. 



(28) 



This set also consists of disjoint trees, but they are not 
binary, nodes can have more than two parents. 

If we do not require 7r-below and consider the larger 
set X P 4 , then such frames are still good representations 
of lakes. A computational advantage of this is that 
each subtree A(aj,bj) in <i? 4 can be searched with the 
MAXDEEP algorithm, by using its top node (a,j , bj ) as 
the input i. The E4 lake finding problem is solved by 
finding frames in , 5 4 using the MAXDEEP algorithm in 



this manner, followed by a second selection based on 
the cutoff p c , which yields the helix stitches for a stitch 
profile. For the maxdeep algorithm we must define 
the depth of a frame. A possible depth definition is 
D(a,b) = max{D(a), D(b)}. Using this, the maxdeep 
algorithm finds frames (a, 6) with D(a) < 
D{b) < D max . 



2. The loop landscape 

Consider the lake finding problem in the loop land- 
scape E 3 . It follows from Eqs. 0-® that for 1 < x < 
y — 1 < N, 

E 3 (x,y) =E 1 (x)+E 2 (y) -log 10 Q(y-x)+ const. (29) 

x and y do not decouple in £3. But log 10 f2(y — x) 
varies slowly enough to be considered constant as an ap- 
proximation. With this assumption, it turns out that 
an analysis parallel to the one for the helix landscape 
gives reasonable stitch profiles. There are only minor 
differences between the loop and helix cases. For ex- 
ample, a loop frame is said to be above the diagonal if 
Lpt(a) + 1 < Li,(b). Here is a brief outline: We define the 
frame tree 



T 3 = {(a, b) G *i x \E» 2 |(o,6) is cr-above}. 



(30) 



The root of T3 is {p\,p2)- By replacing p§ with p\ and 
P6 with P2 in Eqs. (|21|l and 124fl - H26[) we define cr-above, 
successors, fathers, and mothers in T3. The set of cr- 
above frames that are also above the diagonal is 



*3 = {(a, b) G r 3 |L R (a) + 1 < L h (b)}. 



(31) 



And we search ^3 for frames using the top nodes of the 
subtrees and the maxdeep algorithm. Subsequently, the 
loop stitches are found by a selection based on the cutoff 

Pc- 



C. Stitch profile data 

A mere calculation of the block probabilities [cf. 
Eqs. Q-Q] gives 0{N 2 ) small numbers which are infor- 
mation in a fragmented form. In principle, these block 
probabilities can be obtained as functions of x and/or 
y using the Poland-Scheraga model [l3|. the Peyrard- 
Bishop model 01 > or something else. The peak find- 
ing method is applied for putting this information in the 
more useful form of a stitch profile, which is data of size 
O(N) only. A stitch profile is a set of stitches of the four 
types in Fig. ^ As we have seen, each fluctuation bar 
corresponds to a lake in a ID landscape. Figure [S] shows 
how the two fluctuation bars of a loop stitch (a, 6) span 
the lake intervals L(a) and L(b), and how the diagram 
also indicates the position of the lake bottom (f3a,(3b), 
where the probability peak has its maximum. Two addi- 
tional quantities are associated with a stitch: the depth 
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FIG. 5: Eight quantities characterize a loop stitch. 

D(a, 6) and the peak volume p v (a,b). These quantities 
can also be illustrated, for example, by labeling the stitch. 
Thus, eight quantities are associated with each loop or 
helix stitch, but only five quantities for left or right tail 
stitches that only have one fluctuation bar. 

III. DISCUSSION 

This section discusses different aspects of stitch pro- 
files, what information they represent, and the choice of 
parameters and algorithm. The 48 kbp phage lambda 
genome (GenBank accession number NCJ301416) is used 
as a test sequence, to illustrate stitch profiles and prob- 
ability profiles rather than the melting behavior vs. biol- 
ogy of lambda . All stitch profiles were calculated for 
the whole 48 kbp sequence, but the interesting features 
arc viewed in windows of length 1 kbp-20 kbp. The stitch 
profiles in full length are better viewed on a computer 
than in print |3fij . Partition functions were calculated in 
the Poland-Scheraga model using the algorithm in Ref. 12 
with (3=1. The parameter set of Blake & Delcourt Q 
at [Na + ] = 0.075 M was applied, with the loop entropy 
factor Vl(y — x) — (j\2(y — x) + 1]~ Q reparametrized by 
Blossey & Carlon |l<| using a = 2.15 and a = 1.26 x 10 . 

A. Alternative conformations 

As previously stated, a stitch profile shows a number 
of alternative conformations that coexist in equilibrium 
at a given temperature. It does so in two ways: (1) each 
stitch represents a fluctuational variation of its bound- 
aries, and (2) alternative rows of stitches represent alter- 
native series of loops and helices along the chain. Fig- 
ure shows a short sequence window of a stitch pro- 
file, in which there are nine loop stitches and nine he- 
lix stitches. Note how the fluctuation bars and the lake 
bottoms of loops often coincide with those of helices, al- 
though they were calculated independently of each other. 
This is expected because a helix must begin where a loop 
or tail ends, of course, so coinciding fluctuation bars re- 
flect the same boundary fluctuation. A row of stitches 
is a series of stitches connected in a chain by coincid- 
ing boundaries, and in the diagram, it forms a contin- 
uous path or thread, which alternates between the up- 
per and lower side. There are often several stitches to 
choose among at a given boundary, resulting in a com- 
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FIG. 6: A stitch profile (in the box) represents a number of al- 
ternative conformations. In this example, there are seven pos- 
sible conformations listed schematically below the box. The 
parameters are D ma x = 3, p c — 0.02 and T = 81.9° C and the 
sequence window is 35.5 kbp-37 kbp. 

binatorial number of alternative rows of stitches. Each 
row of stitches corresponds to a specific conformation 
(apart from the fluctuational variation) of a region that 
is much longer than the regions specified by the individ- 
ual stitches. The stitch profile in Fig. is aligned with a 
schematic list of the seven alternative conformations cor- 
responding to the possible rows of stitches. These alter- 
natives do not represent the only possible conformations 
in that window — strictly speaking, any conformation has 
a non-zero probability — but they represent the most sta- 
ble conformations in terms of probability peak volume 
and depth. 

Note that Fig. also illustrates that stitches are sorted 
and stacked vertically in the diagram according to their 
lengths fib — [3a. This is for aesthetic reasons only. There 
is no quantity associated with the vertical axis. Fluc- 
tuation bars are also placed on different levels to avoid 
overlap. 



B. Correlations and cooperativity 

DNA cooperativity is the presence of certain long- 
range correlations, that should be distinguished from the 
long-range interactions embedded in the loop entropy 
factor Q(y — x). In a probability profile, cooperativity 
appear as the characteristic plateaus that indicate the 
tendency of blocks of basepairs to "melt as one" — being 
either all 0's or all l's. Basepairs within such a coopera- 
tive region are strongly correlated. This aspect of coop- 
erativity is also prominent in a stitch profile, where the 
block organization is shown more specifically. For com- 
parison, Fig. [7] shows a stitch profile and a probability 
profile of the same piece of DNA. Stitches in the diagram 




FIG. 7: (Color online) Comparison of a stitch profile and a probability profile, both calculated at T ~ 90° C where the helicity 
is O = 0.1. The curve in the middle (in red) is the probability profile 1 — p^ p (x) and it varies between and 1 (vertical axis 
not shown). Each stitch is labeled with its peak volume p v in percent. The parameters are D ma x = 3 and p c = 0.02 and the 
sequence window is 11 kbp-17.4 kbp. 



are labeled with their peak volumes. As can be seen, 
plateaus in the probability profile can be identified with 
one or more stitches where there are correspondences in 
position and between the peak volumes and the plateau 
height relative to surrounding plateaus. 

In addition to the strong correlations inside a coop- 
erative region, there are weaker correlations over longer 
distances. For example, a certain loop at one site can in- 
fluence what loops that can exist at a distant site. Infor- 
mation about such correlations across one or more stitch 
boundaries may be derived from the alternative rows of 
stitches that indicate the possible multi-loop conforma- 
tions. However, this has yet to be developed formally. 

It can be expected that stitch profiles represent DNA 
cooperativity better than probability profiles, when con- 
sidering the types of probabilities involved: Correlations 
between basepairs xt and Xj can be formulated using 
conditional probabilities p(xi\xj). Conditional probabil- 
ities can not be derived from a probability profile p(xi) 
alone, but they can be derived using block probabilities 

p(Xi . ..Xj). 

Figure [7| also illustrates that some stitches have a dead 
end, that is, a boundary that does not coincide with other 
stitches' boundaries. A row of stitches can not be contin- 
ued at a dead end. Stitches with dead ends typically have 
low peak volumes. Dead ends exist because the continu- 
ation of a row splits up in several stitches that all have 
peak volumes below the cutoff and are therefore not in- 
cluded. It is possible to make stitch profiles without dead 
ends by replacing the simple cutoff selection with some 
other appropriate method. 



C. The parameters D max and p c 

We have seen in Fig. 0] how D max controls the lakes 
found by the MAXDEEP algorithm. An effect of increas- 
ing -D max is to increase the widths of the fluctuation bars 
(i.e. lakes) in a stitch profile. Another effect has to do 
with the hierarchical merging of lakes: In Fig. \7\ the 
fluctuation bars correspond to the sloping parts of the 
probability profile. However, these sloping parts contain 
smaller plateaus. Decreasing Z? max can reveal this inter- 
nal structure, by splitting a stitch into several stitches 
with smaller fluctuation bars. 

Figure [S] illustrates the role of p c . When p c is lowered, 
an extra set of stitches is added to the stitch profile. p c 
does not modify the stitches as found by the maxdeep al- 
gorithm, it only controls how many of them are included 
in the diagram. The extra stitches represent rare events 
(low probabilities), and they provide a more finegrained 
picture like higher order terms in an expansion. The 
number of stitches in a stitch profile directly depends on 
Pc but it also depends on -D max indirectly, because the 
peak volumes depend on the widths of lakes and frames. 

The stitch profiles made using this article's methods 
depend on the values of D max and p c . The choice of 
these values depends on what we want to see. One could 
seek for alternative methods that are parameterfree or 
only use one parameter by applying, for example, an op- 
timization scheme and some optimality criterion. But in 
my opinion, reducing the number of parameters would 
only hide the fact that the peak finding task involves two 
different types of choice: (1) lumping together related 
events into a peak and (2) selecting what peaks to in- 
clude. The cutoff selection method using p c is simple to 
program, requires only little computer power, and can 
reuse data from a stitch profile that has a lower cutoff. 
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FIG. 8: (Color online) Extra stitches are included when the 
cutoff p c is lowered. The first stitch profile (p c = 0.02) con- 
tains stitches with peak volumes p v > 0.02. The second stitch 
profile (p c = 0.001) contains the same stitches as the first 
one (in blue), plus an extra set of stitches with peak vol- 
umes 0.001 < p v < 0.02 (in red). The other parameters are 
n max = 3 and T ~ 80.6° C and the sequence window is 24 
kbp-27.7 kbp. 



Therefore, p c can be used in practice for "finetuning" by 
trying out different values iteratively. In this way, we 
can make a stitch profile with a certain total number 
of stitches. Or a stitch profile with a certain maximum 
height of the stackings of stitches in the diagram, which 
would limit the visual complexity (cf. Fig- EI> • 



D. Ensemble representation 

The alternative conformations represented by a stitch 
profile are few in numbers compared to the total number 
2 N of possible conformations. But they may constitute a 
considerable fraction of the ensemble in terms of statis- 
tical weight. Is this fraction big or small? A direct an- 
swer would be obtained by comparing the total partition 
function Z with the partition function restricted to the 
stitch profile conformations. Instead, we take a graph- 
ical approach that relates peak volumes to basepairing 
probabilities. A stitch profile provides upper and lower 
bounds of the corresponding probability profile. For ex- 
ample, the presence of a loop stitch as in Fig. [S] implies 
that basepairs in that region are melted with probabil- 
ity greater than the peak volume. Or more precisely, 
1 — Php(x) > p v {a,b) for Ln(a) < x < L^(b). Stitches 
that overlap are mutually exclusive, so we can sum over 
stitches to obtain bounds as follows. Recall that an indi- 
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FIG. 9: (Color online) The probability profile 1 — p^ p (x) of 
Fig-Qis plotted here (in red) with its upper and lower bounds 
(in blue), f — pi ow (:r) and f — p up (x), obtained from the stitch 
profile. 



cator function is defined as 

J MW | o otherwise. [6) 
Then pi ow (x) < p hp (x) < p up (x), where 

Pu P {x) = I- / [ix L (a)-i](^K(a) 

left tail a 

~ zZ / [iR(a)+i,i L (b)-i]( a; K(a,6) 

loop (a, b) 

~ ^[L R (a)+i,JV](a;K(a), (33) 



and 



right tail a 



/'lowO) = I [iR(a),i L (fe)]( a: K(a,&)- 
helix (a, b) 



(34) 



Figure shows the probability profile I — Ph P {x) from 
Fig. [7| together with its two bounds calculated using the 
stitch profile in Fig. and Eqs. O and (p^fl. The three 
curves are quite close. The two bounding curves con- 
sist of vertical and horizontal lines because of the block 
nature of the stitch profile. Given this constraint, they 
almost come as close as possible to the probability pro- 
file. The probability profile can apparently be reproduced 
from the stitch profile with only a small error. This sug- 
gests that the conformations represented by the stitch 
profile constitute a majority of the ensemble in terms of 
statistical weight. 



E. Predicting metastable conformations 

A depth between zero and Z? ma x is associated with each 
stitch in a stitch profile. A depth relates to the landscape 
picture, but it actually characterizes the probabilities of 
boundary positions: The probability of a boundary in 
the fluctuation interval L{a) to be located at [3a is 1Q D ^ 
times greater than the probability of being at L^(a) or 
i R (a). 

The landscape picture of lakes confined by barriers has 
been borrowed in this article for the purpose of probabil- 
ity peak finding. This should be distinguished from the 
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usual purpose of describing the nonequilibrium behavior 
of systems that are "trapped" in metastable states . 
Nevertheless, an idea of dynamics is implicit when talk- 
ing about fluctuations. Are the fluctuation bars related 
to actual fluctuations over time? 

In a dynamics interpretation, a ID lake L(a) describes 
the ranges of the diffusion of a boundary position on 
timescale r oc 10 D ( a \ A stitch profile would then pre- 
dict the ranges of fluctuations that can be observed in 
an experiment during time r oc lO^ 1 "** with an empiri- 
cal constant of proportionality. However, this timescales 
interpretation is preliminary until the rates of nucleation 
events creating new loops, tails, and helices are accounted 
for. If the depths are related to timescales, a possible 
application of stitch profiles is to predict metastable con- 
formations that could play an important role in intracel- 
lular DNA or other nonequilibrium situations. Stitches 
that have large depths (i.e. long-lived) and small peak 
volumes (i.e. rare) are expected to indicate metastable 
conformations. They can be easily found using a slightly 
modified cutoff selection method. It is more difficult to 
detect such conformations in a probability profile because 
of their low probabilities. 

F. Applications 

In conclusion, peaks in the block probabilities 
[Eqs. 0-(@J] can be found and represented as stitches 
in a stitch profile. A stitch profile indicates the sizes and 
locations of loops, tails and helical regions, their proba- 
bilities and "depths" , and how they fluctuate. Multi-loop 
conformations can be derived from the alternative rows 
of stitches, which may show correlations between distant 
stitches. A stitch profile thus predicts the conformations 
of partly melted DNA and can account for a majority of 
the conformational ensemble in terms of the basepairing 
probabilities. 

Stitch profiles are motivated by the general idea that a 
better prediction of DNA's conformational behavior may 
contribute to a better understanding of DNA's functional 
behavior, when there is a structure-function relationship. 
Strand separation is at the heart of various processes that 
occur in chromatin and chromosomes lj. The question 
is, what role does the sequence dependence of loop sta- 
bilities as predicted for DNA melting play in biology? 
It is reasonable to believe that the low stability of AT- 
rich regions is important in origins of replication Q] and 
in transcription initiation pol |. Furthermore, bioinfor- 
matic evidence points at more extensive and not yet ex- 
plained correlations in some genomes between the pre- 
dicted melting properties and the organization along the 



sequence of exons, introns, and other genetic elements 
EE El Hi- There are different hypotheses. One 

view is that DNA mainly is digital information storage 
and that such correlations is a secondary effect reflect- 
ing, for example, varying compositions of proteins |25j| . 
Another view is that DNA is also physical and that loop 
stabilities and/or other sequence dependent biophysical 
properties of DNA contribute actively in different biolog- 
ical mechanisms. Some more or less speculative examples 
are: recombination and crossover, sister chromatid adhe- 
sion, DNA-protein interactions, and intron insertion |'2(j| . 
DNA conformational changes in a cell are not driven by 
temperature changes, but rather by molecular forces and 
interactions. Why then are DNA melting predictions rel- 
evant? The Poland-Scheraga model deals with in vitro 
conditions that are far from the conditions in chromatin: 
crowding [27] is not accounted for in the loop entropy 
factor, condensation and protein interactions are miss- 
ing, and topology and chromosomal geography is not 
accounted for. Fortunately, the correlations found by 
Yeramian and others suggest a robustness of the pre- 
dicted melting properties. Stitch profiles may also apply 
to intracellular DNA. 

As mentioned in the Introduction, the development of 
stitch profiles is also motivated by new single molecule 
techniques, in which some of the predicted properties 
could be measured. For example, it would be interest- 
ing to compare with measurements of bubble sizes and 
their statistical weights [8|, positions and stabilities of 
"tails" 0, and bubble lifetimes |g. As explained in the 
previous section, it is an open question how the depths 
of stitches relate to the lifetimes of the corresponding 
conformational features. 

Stitch profiles may supplement the use of ordinary 
melting profiles in the design and interpretation of in 
vitro experiments such as gel electrophoresis [2^, |2^ and 
in the design of probes and primers for PCR and mi- 
croarrays |30j. For example, stitch profiles emphasize the 
ensemble aspect and may thereby predict some features 
of gel experimental data. For short DNAs, however, it 
is relevant to consider also secondary structure, slippage 
and mismathes [3ll l32] | , that are not accounted for in the 
Poland-Scheraga model. 

A web server for computing stitch profiles has been 
made available at http://stitchprofiles.uio.no |33|. 
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tinguished from looping in which the helix as a whole 
bends back on itself. 

See EPAPS Document No. ? for an online view of the 
stitch profiles of lambda DNA at different temperatures 



